Skip to content

Conversation

heyihong
Copy link
Contributor

@heyihong heyihong commented Jul 14, 2025

What changes were proposed in this pull request?

This PR optimizes the LiteralValueProtoConverter to reduce redundant type information in Spark Connect protocol buffers. The key changes include:

  1. Optimized type inference for arrays and maps: Modified the conversion logic to only include type information in the first element of arrays and the first key-value pair of maps, since subsequent elements can infer their types from the first element.

  2. Added needDataType parameter: Introduced a new parameter to control when type information is necessary, allowing the converter to skip redundant type information.

  3. Updated protobuf documentation: Enhanced comments in the protobuf definitions to clarify that only the first element needs to contain type information for inference.

  4. Improved test coverage: Added new test cases for complex nested structures including tuples and maps with array values.

Why are the changes needed?

The current implementation includes type information for every element in arrays and every key-value pair in maps, which is redundant and increases the size of protocol buffer messages. Since Spark Connect can infer types from the first element, including type information for subsequent elements is unnecessary and wastes bandwidth and processing time.

Does this PR introduce any user-facing change?

No - This PR does not introduce any user-facing changes.
The change is backward compatible and existing connect clients will continue to work unchanged.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.4.5

@heyihong heyihong changed the title [SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Expression.Literal.Array optional [WIP][SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Array optional Jul 14, 2025
@heyihong heyihong force-pushed the SPARK-52449 branch 4 times, most recently from 62372ff to 2f4c24b Compare July 16, 2025 14:15
@heyihong heyihong changed the title [WIP][SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Array optional [SPARK-52449][CONNECT][PYTHON][SQL] Make datatypes for Expression.Literal.Map/Array optional Jul 16, 2025
@heyihong heyihong force-pushed the SPARK-52449 branch 4 times, most recently from 9e7737d to 71814d0 Compare July 16, 2025 15:28
@heyihong
Copy link
Contributor Author

@heyihong heyihong requested a review from hvanhovell July 16, 2025 19:00
@heyihong heyihong force-pushed the SPARK-52449 branch 3 times, most recently from 7e5b478 to da00208 Compare July 16, 2025 19:57
@HyukjinKwon
Copy link
Member

cc @zhengruifeng too

case proto.Expression.Literal.LiteralTypeCase.BOOLEAN =>
(literal.getBoolean.asInstanceOf[Object], classOf[Boolean])
case proto.Expression.Literal.LiteralTypeCase.ARRAY =>
val array = literal.getArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @WeichenXu123 for the ml side

@heyihong heyihong changed the title [SPARK-52449][CONNECT][PYTHON][SQL] Make datatypes for Expression.Literal.Map/Array optional [SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional Jul 17, 2025
@heyihong heyihong requested a review from zhengruifeng July 17, 2025 13:41
@heyihong heyihong force-pushed the SPARK-52449 branch 4 times, most recently from c819bcb to c579c1c Compare July 21, 2025 13:56
@heyihong
Copy link
Contributor Author

heyihong commented Jul 21, 2025

friendly ping @hvanhovell @HyukjinKwon @beliefer @zhengruifeng @WeichenXu123

Update: No need to review at the moment — I need to finish SPARK-52930 first.

@heyihong
Copy link
Contributor Author

heyihong commented Sep 1, 2025

@@ -1,2 +1,2 @@
Project [id#0L, id#0L, 1 AS 1#0, null AS NULL#0, true AS true#0, 68 AS 68#0, 9872 AS 9872#0, -8726532 AS -8726532#0, 7834609328726532 AS 7834609328726532#0L, 2.718281828459045 AS 2.718281828459045#0, -0.8 AS -0.8#0, 89.97620 AS 89.97620#0, 89889.7667231 AS 89889.7667231#0, connect! AS connect!#0, T AS T#0, ABCDEFGHIJ AS ABCDEFGHIJ#0, 0x78797A7B7C7D7E7F808182838485868788898A8B8C8D8E AS X'78797A7B7C7D7E7F808182838485868788898A8B8C8D8E'#0, 0x0806 AS X'0806'#0, [8,6] AS ARRAY(8, 6)#0, null AS NULL#0, 2020-10-10 AS DATE '2020-10-10'#0, 8.997620 AS 8.997620#0, 2023-02-23 04:31:59.808 AS TIMESTAMP '2023-02-23 04:31:59.808'#0, 1969-12-31 16:00:12.345 AS TIMESTAMP '1969-12-31 16:00:12.345'#0, 2023-02-23 20:36:00 AS TIMESTAMP_NTZ '2023-02-23 20:36:00'#0, 2023-02-23 AS DATE '2023-02-23'#0, INTERVAL '0 00:03:20' DAY TO SECOND AS INTERVAL '0 00:03:20' DAY TO SECOND#0, INTERVAL '0-0' YEAR TO MONTH AS INTERVAL '0-0' YEAR TO MONTH#0, 23:59:59.999999999 AS TIME '23:59:59.999999999'#0, 2 months 20 days 0.0001 seconds AS INTERVAL '2 months 20 days 0.0001 seconds'#0, 1 AS 1#0, [1,2,3] AS ARRAY(1, 2, 3)#0, [1,2,3] AS ARRAY(1, 2, 3)#0, map(keys: [a,b], values: [1,2]) AS MAP('a', 1, 'b', 2)#0, [a,2,1.0] AS NAMED_STRUCT('_1', 'a', '_2', 2, '_3', 1.0D)#0, null AS NULL#0, [1] AS ARRAY(1)#0, map(keys: [1], values: [0]) AS MAP(1, 0)#0, map(keys: [1], values: [0]) AS MAP(1, 0)#0, map(keys: [1], values: [0]) AS MAP(1, 0)#0, [[1,2,3],[4,5,6],[7,8,9]] AS ARRAY(ARRAY(1, 2, 3), ARRAY(4, 5, 6), ARRAY(7, 8, 9))#0, [keys: [a,b], values: [1,2],keys: [a,b], values: [3,4],keys: [a,b], values: [5,6]] AS ARRAY(MAP('a', 1, 'b', 2), MAP('a', 3, 'b', 4), MAP('a', 5, 'b', 6))#0, map(keys: [1,2], values: [keys: [a,b], values: [1,2],keys: [a,b], values: [3,4]]) AS MAP(1, MAP('a', 1, 'b', 2), 2, MAP('a', 3, 'b', 4))#0, [[1,2,3],keys: [a,b], values: [1,2],[a,keys: [1,2], values: [a,b]]] AS NAMED_STRUCT('_1', ARRAY(1, 2, 3), '_2', MAP('a', 1, 'b', 2), '_3', NAMED_STRUCT('_1', 'a', '_2', MAP(1, 'a', 2, 'b')))#0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the diff here? why it changed the analyzed plan?

Copy link
Contributor Author

@heyihong heyihong Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng The diff is caused by an extra test case I added:

fn.typedLit(
  Seq(
      mutable.LinkedHashMap("a" -> Seq("1", "2"), "b" -> Seq("3", "4")),
      mutable.LinkedHashMap("a" -> Seq("5", "6"), "b" -> Seq("7", "8")),
      mutable.LinkedHashMap("a" -> Seq.empty[String], "b" -> Seq.empty[String])))
[keys: [a,b], values: [[1,2],[3,4]],keys: [a,b], values: [[5,6],[7,8]],keys: [a,b], values: [[],[]]] AS ARRAY(MAP('a', ARRAY('1', '2'), 'b', ARRAY('3', '4')), MAP('a', ARRAY('5', '6'), 'b', ARRAY('7', '8')), MAP('a', ARRAY(), 'b', ARRAY()))#0

@zhengruifeng
Copy link
Contributor

@heyihong please resolve the conflicts

@heyihong heyihong force-pushed the SPARK-52449 branch 3 times, most recently from e5e59aa to dc6ed7c Compare September 8, 2025 10:38
@zhengruifeng
Copy link
Contributor

merged to master

dongjoon-hyun added a commit to apache/spark-connect-swift that referenced this pull request Oct 1, 2025
…th `4.1.0-preview2`

### What changes were proposed in this pull request?

This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`.

### Why are the changes needed?

There are many changes from Apache Spark 4.1.0.

- apache/spark#52342
- apache/spark#52256
- apache/spark#52271
- apache/spark#52242
- apache/spark#51473
- apache/spark#51653
- apache/spark#52072
- apache/spark#51561
- apache/spark#51563
- apache/spark#51489
- apache/spark#51507
- apache/spark#51462
- apache/spark#51464
- apache/spark#51442

To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`.

```
$ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git
$ cd spark/sql/connect/common/src/main/protobuf/
$ protoc --swift_out=. spark/connect/*.proto
$ protoc --grpc-swift_out=. spark/connect/*.proto

// Remove empty GRPC files
$ cd spark/connect

$ grep 'This file contained no services' *
catalog.grpc.swift:// This file contained no services.
commands.grpc.swift:// This file contained no services.
common.grpc.swift:// This file contained no services.
example_plugins.grpc.swift:// This file contained no services.
expressions.grpc.swift:// This file contained no services.
ml_common.grpc.swift:// This file contained no services.
ml.grpc.swift:// This file contained no services.
pipelines.grpc.swift:// This file contained no services.
relations.grpc.swift:// This file contained no services.
types.grpc.swift:// This file contained no services.

$ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift
```

### Does this PR introduce _any_ user-facing change?

Pass the CIs.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #250 from dongjoon-hyun/SPARK-53777.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants